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Abstract 

Background: The ancestries of genes form gene trees which do not necessarily have the same topology as the 
species tree due to incomplete lineage sorting. Available algorithms determining the probability of a gene tree given 
a species tree require exponential computational runtime. 

Results: In this paper, we provide a polynomial time algorithm to calculate the probability of a ranked gene tree 
topology for a given species tree, where a ranked tree topology is a tree topology with the internal vertices being 
ordered. The probability of a gene tree topology can thus be calculated in polynomial time if the number of orderings 
of the internal vertices is a polynomial number. However, the complexity of calculating the probability of a gene tree 
topology with an exponential number of rankings for a given species tree remains unknown. 

Conclusions: Polynomial algorithms for calculating ranked gene tree probabilities may become useful in developing 
methodology to infer species trees based on a collection of gene trees, leading to a more accurate reconstruction of 
ancestral species relationships. 

Keywords: Incomplete lineage sorting, Coalescent history, Anomalous gene tree, Dynamic programming 



Background 

Phylogenetic reconstruction methods aim to infer the 
species phylogeny which gave rise to a group of extant 
species. Typically, this species phylogeny is obtained based 
on genetic data from representative individuals of each 
extant species. The ancestries of genes at different loci 
form gene trees which do not necessarily have the same 
topology as the species tree. Gene tree topologies and 
species tree topologies might be different due to such 
phenomena as incomplete lineage sorting, gene duplica- 
tion, recombination within gene loci, and horizontal gene 
transfer [1]. In this paper, we focus on incomplete lineage 
sorting as the mechanism for incongruence of gene tree 
and species tree topologies, in which two gene lineages 
do not coalesce in the most recent population ancestral 
to the individuals from which the genes were sampled. As 
an example, the lineages sampled from species A and B in 
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Figure lb do not coalesce until the population ancestral 
to species A, B, and C, thus allowing the B and C lineages 
in the gene tree to have a more recent common ancestor 
than lineages A and B. 

Given a fixed species tree, and assuming the gene tree 
evolved under the multi-species coalescent [1], the most 
probable gene tree topology can have a different topology 
from that of the species tree. Such a gene tree topology is 
called an anomalous gene tree. In fact, for every species 
tree topology with at least 5 leaves, we can choose edge 
lengths in the species tree topology such that anoma- 
lous gene trees exist [2]. This implies that the gene tree 
topology appearing most often when considering differ- 
ent genes might not agree with the species tree topology, 
thus we cannot use a simple majority-heuristic to infer the 
species tree from a collection of gene trees. Instead we 
need statistical tools rather than majority rule heuristics 
for inferring the species tree based on gene trees. 

Current methods for inferring species trees from gene 
trees in this setting can be divided into topology- 
based and genealogy-based methods, in which the input 
for a reconstruction algorithm accepts either gene tree 
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Figure 1 In (a)-(d) the ranked species tree topology is ({(A,B) i , C) 2 , (a) The ranked gene tree matches the ranked species tree, 

(b) The (ranked or unranked) gene tree does not match the species tree, and there is an incomplete lineage sorting event (a deep coalescence) 
because the lineages from species A and B fail to coalesce more recently than 52. (c) The gene tree and species tree have the same unranked 
topology but have different ranked topologies, as Dand E coalesce in the gene tree more recently than A and 8, while A and 8 is the most recent 
divergence in the species tree. The gene tree in (c) has ranked topology (((A 8)3,02. (D, £)4)i- In (c), there are no incomplete lineage sorting events 
(no deep coalescences); however, there is an extra lineage at time S3 which leads to the gene tree and species tree having different rankings. In (c), 
all coalescences occur in the most recent possible interval consistent with the ranked gene tree, and we have l\ = 2, £2 = 3, £3 = 5, £4 = 5, and 
gi = 2, g2 = 3, 93 = 5, <j4 = 5. (d) A gene tree with the same ranked topology as the gene tree in (c) but with coalescences occurring in different 
intervals. 



topologies or genealogies, i.e., gene trees with branch 
lengths (coalescence times). Topology-based methods 
include Minimize Deep Coalescence (MDC) [3,4], STAR 
[5], STELLS [6], rooted triple consensus [7] and other 
consensus and supertree methods [8,9]. Genealogy-based 
methods include Bayesian and likelihood methods such 
as BEST, *BEAST, and STEM [10-12] and clustering and 
distance-based methods [5,13-15]. Possible pros and cons 
of the two approaches are that topology-based meth- 
ods can be computationally faster and less sensitive to 
errors in estimating gene trees (and gene tree branch 
lengths) from sequence data [16], while methods that use 
coalescence times, particularly using Bayesian modelling, 
can be the most accurate when model assumptions are 
correct [17]. 

Another possibility that has been so far unexplored in 
methods for inferring species trees from gene trees is 
to use ranked gene trees, in which the temporal order 



of the nodes of the gene tree (the coalescence times) is 
used, but not the continuous-valued branch lengths. This 
approach might therefore be intermediate between purely 
topology-based methods and genealogy-based methods. 
By preserving more of the temporal information in the 
gene tree nodes, the hope is to develop methods that are 
more powerful than purely topology-based methods and 
that are still computationally efficient and robust to errors 
in estimating gene trees and gene tree branch lengths from 
sequence data. 

In [18], a first step toward developing methods that use 
ranked gene trees for inferring species trees was taken by 
providing formulae to calculate the probability of a ranked 
gene tree given a species tree. The previous work, how- 
ever, was based on an exponential enumeration of what 
were called ranked coalescent histories and did not pro- 
vide an algorithm for computing some of the key terms 
in the probability of individual ranked histories. In this 
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paper, we improve this previous (computationally ineffi- 
cient) approach, by providing a method for computing 
probabilities of ranked gene trees given species trees 
which is polynomial in the number of leaves using a 
dynamic programming approach. 

Methods for computing probabilities of ranked gene 
trees efficiently may also be of interest in the context of 
computing probabilities of unranked gene trees, partic- 
ularly because no polynomial time algorithm has been 
found for calculating the probability of a gene tree topol- 
ogy given a species tree under the multispecies coalescent 
[6,19-21]. The probability of an unranked gene tree topol- 
ogy can be obtained by summing over all ranked gene tree 
topologies with the same topology. Thus, for unranked 
gene trees with particular shapes where the number of 
rankings increases in polynomial time, using ranked gene 
trees can potentially increase the speed of computing 
probabilities of unranked gene trees as well. We note that 
a completely unbalanced gene tree has only one ranking, 
while the number of rankings can be exponential in the 
number of leaves when gene trees become more balanced. 
Thus, our approach for calculating unranked gene tree 
probabilities will be most useful for less balanced ranked 
gene trees. 

The bulk of the paper consists of the derivation of the 
polynomial time method for computing ranked gene tree 
probabilities. The algorithm is summarized in section 'An 
algorithm! This is followed by a discussion of applications 
to computing probabilities of unranked gene tree topolo- 
gies and to inferring ranked species trees under maximum 
likelihood and a modification to the MDC criterion. 

Calculating the probability of a ranked gene tree 
topology 

In the following, we will derive the probability of a 
ranked gene tree topology given a species tree, P[Sf | 71. 
Equations (1, 2, 3, 4, 8, 10) allow the calculation of 
P[ | 71 in time 0(« 5 ). The model giving rise to the gene 
tree is the multi-species coalescent with constant popu- 
lation sizes [1]. Each species consists of a population of 
constant size where lineages merge according to the coa- 
lescent. Thus, lineages from two different species may 
coalesce any time previous to the split of the two species. 

We begin with some notation, which is also summarized 
in Table 1. Let time be 0 today and increasing going into 
the past. Let T be a species tree with n species, and thus 
n — 1 speciation events (denoted by 1, ...,« — 1) occur- 
ring at times Si > • • • > s n -\. Denote the interval between 
speciation event i — 1 and speciation event i by t,-, see 
Figure 1. 

Let be a ranked gene tree topology. It is convenient to 
use the same labels for the leaves of and of T. This is a 
slight abuse of notation, as leaf A of T refers to a popula- 
tion (or species), and A of <S refers to a gene sampled from 



Table 1 Notation used in the paper 

Symbol Meaning 

r 

n 

Si 

n 
U 
rr)i 

% 
9i 

y<> 

Ui 



&(y),8(u) The set of leaves descended from a node of the 

species tree or gene tree, respectively 

lca(u) For a node u of the gene tree, the node y of the 

species tree with largest rank such that S(u) C S(y) 

r (y) For a node y with rank / on the species tree, we denote 

r(y) = tj (the interval immediately above y) 

Xij The overall coalescence rate in interval X; immediately 

preceding (backwards in time) the j'th coalescence 

Hk Number of sequences of coalescences above the root 

of the species tree starting with k lineages 

fi The joint density of coalescence times in interval x; 

population A. We denote the nodes of Sf (which are coa- 
lescence events) by Mi, ... , w„_i, where node Uj has rank /, 
and where higher rank indicates a more recent coales- 
cence. A ranked tree topology can be notated similarly to 
Newick notation, putting the rank as a subscript for each 
node, see also Figure 1. 

Let ^iXi be part of a ranked gene tree evolving on a 
species tree between time s; and time 0 (i.e. the present). 

consists of ii gene tree lineages at speciation time S; 
and the coalescent history of in time interval (0,5,) 
is consistent with the ranked gene tree Sf . Let gi be the 
minimum number of lineages required in the ranked gene 
tree at time S; such that Sf can be embedded into the 
species tree T. Note that n > li > gi > i. Next we 
provide a dynamic programming approach for calculat- 
ing the probability of a ranked gene tree given a species 
tree. An efficient way to determine the required quantities 
gi, . . . ,g n -\ is provided in Section 'Calculation of gi and 

Essentially, in our approach, we traverse the intervals 
between speciation events going back in time, r„_i, . . . , t2 



Species tree with real-valued divergence times 

Ranked gene tree (real-valued coalescence times not 
specified) 

The number of leaves of 7~andSf 

Speciation times, with Si > ■ ■ ■ > s n _i , let s 0 = °o 

Intervals between speciation times, x; =[S;,s,_i) 

The number of gene tree lineages at times,- 

The number of coalescence events in interval r, 

The ranked gene tree observed from time 0 to time s, 

The minimum number of gene tree lineages at times, 

Population z in interval r,- in beaded tree 

Internal node (coalescence) with rank / in the gene 
tree, U] is most ancient, u n _i is the most recent 

The number of lineages available for coalescence in 
population y, z just after the y'th coalescence (consid- 
ered forward in time) in interval Xf, /c,,o,z is the number 
of lineages "exiting" at time s/_i 

The set of leaves descended from a node of the 
species tree or gene tree, respectively 

For a node u of the gene tree, the node y of the 
species tree with largest rank such that S(u) C S(y) 

For a node y with rank ion the species tree, we denote 
r(y) = Xj (the interval immediately above y) 

The overall coalescence rate in interval r,- immediately 
preceding (backwards in time) the j'th coalescence 

Number of sequences of coalescences above the root 
of the species tree starting with k lineages 

The joint density of coalescence times in interval r, 
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(formalized in Theorem 2), and calculate the probability 
of the appropriate coalescent events occurring in inter- 
val T/ based on how many coalescent events happened 
in the later intervals r;+i, . . . , r M _i (Theorem 3). Finally 
with Theorem 1, we account for the most ancestral time 
interval x\. 

Theorem 1. The probability of a ranked gene tree given 
a species tree is, 



h=gi 

where 

H tl =h\ (£i-l)! /2 £l - J 



(1) 



(2) 



is the probability for the coalescences above the root 
appearing in the right order [22]. 

For precalculated P^i,^ \71 Hi = 2, . . . , n) the com- 
plexity of calculating P[Sf|T] is thus 0(n). Next, we 
will provide a recursive way to calculate Pt^XfjT] for 
l\ = 2, . . . , n in polynomial time, thus P[£f | T] can be 
calculated in polynomial time. 

Theorem 2. The probability PtS^JT] can be calcu- 
lated for all i recursively (with k > gi), 

n 

£; + l=max(£;,g; + i) 

(3) 

with 

V[%-l,n\T]=h 

The complexity of calculating P[%|T] for l\ = 
2, . . . , n is 0(« 3 ), given we knowY[% t ii I ^i+i,e i+ i < 71 for all 
i> ^ii ti+\- 

Proof. At the time of the most recent speciation event, 
5„_i, we have n lineages with probability 1, which is the 
initial value of the recursion. Calculating P[S^, |T] for 
i < n — 1 can be done in the following way, 

n 

£ P[Sf«„Sfo-w m IT] 

i i+1 =max(li,gi + i) 
n 

£ P[ %,u - 71 vm+i,e i+l \71 . 

l i+l =max(li,g i+l ) 

Suppose P[^, |^+i^ i+1 ,71 is known. Given we calcu- 
lated the probability P[ %+Ui+i \71 for U+\ = i + 2,...,n, 



then calculating P[S^,|T] for ti = i + l,...,n requires 
°(E"=i/) = 0 (("~2 +1 )) calculations. Summing up 
over i = 1, ...,«— 1 yields a complexity of O (Yl" = 2 Q)) = 
0(("f))=0(n^). □ 

It remains to determine W[&i-i,i,_ 1 \&w l tT]. Note that 
during the interval t/, we have z branches in the species 
tree. Let w?/ be the number of coalescent events in r,-, so 
mi = — Let the number of lineages on branch z 
just after the jth coalescent event (going forward in time) 
in Ti be fc,y jZ . Calculation of /Cy iZ can be done efficiently as 
shown in Section 'Calculation ofg; and Ar i)/)Z ! 



Theorem 3. We /jave, 



(4) 



w/zere A« v = E^i Cf ) and g) := 0. 

Proof. The density for the coalescence events in interval 
T/ can be obtained by considering the waiting time to the 
"next" coalescent event (going backwards in time) as being 
due to competing exponentials in the different branches, 
where the coalescence rate within branch z is ( k '£ z ). Thus, 
the waiting time until the next coalescent event has rate 

hj = eu cn 

We denote the time between the ;th and (j + l)st coa- 
lescent event as Vi, where vq is the time between s,_i and 
the first (least recent) coalescent event in r; and with v m 
being the time between s/ and coalescent event w;. 

The density for the coalescent events in the interval T; 
is [18], 

/i(vo, v l ,...,v mi ) = e~ ^=o ( k 'f>i 

It remains to integrate over v, for which we distinguish 
between case (i) k i>0 = 0, and case (ii) A /;0 > 0. 

Case (i): If kya = 0 (which occurs if = i, i.e., all 
lineages within each population coalesce), then we rewrite 
/ias, 



fi(v 0 ,vi,...,v mi ) = 



nmi i 
;=1 A V e 



(5) 



Using the fact that the integral of the numerator of 
Equation (5) is a hypoexponential distribution based on 
the sum of mi exponential random variables [23] (with 
density functions \ije~ Xi -' v > ', j = 1, . . . , m,-), the probability 
of the coalescent events in the interval is the cumulative 
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distribution function of the hypoexponential distribution 
evaluated at s,_i — S; = Ylj=Q v '~ Thus, with Xy < A.y+1, 



P^-u.-J^T] 



1 ^i, e -*y(s<-i-*f) 



(6) 



where the second line follows because — Xy = A^o — \y. 
Case (ii): If X i>0 > 0, then we rewrite/- as, 



fi(v 0 ,v 1 ,...,v m ) = 



llj=o Kj 



(7) 



For integrating we use the fact that the integral 
of the numerator in Equation (7) is the convolution of 
mi + 1 exponential random variables with parameters 
Xi,o, ■ ■ ■ ,X- hmi , which is the hypoexponential distribution. 
Now, since < A.y+i, we observe, using the probability 
density function of the hypoexponential distribution, 



= M v 0,vi,---,v mi ) dv 

Jv 

e -A.y(s/_i-ij) 



which is the same expression as for the A !; o = 0 case (6). 
Note that for case (i) we made use of the cumulative distri- 
bution function of the hypoexponential distribution, while 
for case (ii) we made use of the density function of the 
hypoexponential distribution. Both cases yield the same 
final expression for P[ ty-i,^ \&M t ,T], which establishes 
the proof. □ 

Corollary 4. The probabilities ^\ ( Si-\,i i _-^i,i i ,T\ for 
all possible i, mi and ii (recall that mi = ii — li-\) are 
calculated in 0(w 5 ), given all Xij. 

Proof. For a fixed i, mi and li, we require O(m^) calcu- 
lations to evaluate P[^_i^,_i |S% ; , T\. We need to deter- 
mine V^i-iXi-i \^i,tp T\ f° r a U possible i, m; and First, 
we observe that i < < n, and thus for a fixed ii, we 
have, 0 < mi < It — i. Second, i < ii < n. And third, 



2 < i < n — 1. Thus, the number of calculations needed to 
calculate T~\ f° r a ^ possible i, mi and is, 

(«— 1 n li—i \ / n— 1 « \ 

E E E^H EEft-) 3 
i=2 l t =i+\ m i= 0 J \ i=2 l t =i+\ ) 

(n-1 
i=2 

= 0(« 5 ). 

□ 

Corollary 5. The quantities Xy can be calculated for all 
possible i, mi, ii and j in 0(n 5 ), given all kij tZ . 

Proof. For a fixed i, mi, ii and /', we require 0(0 calcu- 
lations to evaluate Xij. As j = 0, . . . , m,-, with the same 
arguments as in Corollary 4, we obtain, 

(n-1 n ti—i mi \ (n-1 n li—i \ 

E E EE'H D E E*< 

(n—X n \ 

(n-1 
E'(«-o 3 
j=2 



0(« 5 ). 



□ 



We note that the terms P[^-u j _ 1 |^ i ,'T] are analo- 
gous to the functions defined in [24,25], which give the 
probability that i lineages coalesce into / within time t in 
a single population and are used extensively in computing 
probabilities related to unranked gene trees [6,19,26,27]. 
In particular, if only one population, say z*, has coales- 
cence events, then we have r 

_ gli+iM s i ~ s i+l) Uz^z*Sk iAz ,k iAz (Si - S i+ i) 

~ Yl{' +1 ~ li { li+1 ~ k+1 ) 

a product of gij functions with the denominator counting 
the number of sequences in which m; coalescences could 
have occurred. The terms V[ < ^i-i,e i _ 1 ffift, T] allow for the 
coalescences to occur in separate populations, however, 
and are constrained by the ranking of the gene tree. For 
example, in interval T3 of Figure lc, there are two coales- 
cences which occur in different populations. If the ranking 
of the gene tree were not important, the branches could be 
considered independent, and the probability of this event 
would beg2,i( s 2 — Sz)g2,i(s2 — S3). However, the gene tree 
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ranking constrains the coalescence of A and B to be less 
recent than that of D and E, so the probability for events 
in this interval is, r 

P[SM%5,T] =[£ 2 ,1 (52 " S 3 )] 2 A 

We illustrate that we get the same result from Theorem 3: 
there are two coalescence events in interval T3, so we use 
j = 0, 1, 2, and calculate 




Thus, Equation (4) from Theorem 3 evaluates to 

e -0(s 2 -S 3 ) g-l(S2-«3) g-2(S2-S3) 

(2 - 0)(1 - 0) + (0 - 1)(2 - 1) + (0 - 2)(1 - 2) 

_ I _ g -(S2-S3) _|_ i e -2(s 2 -S3) 

2 2 

= i(i-«-<«->) 2 

= [ffi.ite-«>l 2 /2. 

Remark 6. 27je probability of a gene tree topology is 
the sum of the probabilities of each ranked gene tree 
with the given topology. A given tree topology has (n — 
1)! /Y["=i( c i ~ 1) rankings, where ci is the number of 
descendant leaves of interior vertex i. A proof can be found 
in [28]. For a completely balanced tree on n = 2 k leaves, 
the number of rankings grows faster than polynomial: the 
numerator can be approximated by, 

w!» */27tn(n/e)", 

and the denominator can be approximated by, 

n-l k 

Y[( Ci - 1) = 11(2' - l)«/2< w n k = « l0 & B , 
(=1 (=i 

showing that the ratio grows faster than polynomial in n. 

Calculation of q, and /c,y iZ 
Calculation ofg. 

If T and £f have the same ranked topology, theng; = f + 1. 
In general, to compute gi, we let lca(w/) be the least com- 
mon ancestor node on the species tree for a node Uj on 
the ranked gene tree - i.e., the node with the largest rank 
on the species tree which is ancestral to all species repre- 
sented in Uj. For a node y on the species tree, let T (y) be the 
interval immediately above y. For example, in Figure lc, 
T(lca(M4,)) = T3 where M4 is the gene tree node with rank 



4 — the node ancestral to D and E only. In order to com- 
pute gi, we count the number of gene tree nodes which 
may occur closer to the present than Si. These are precisely 
all gene tree nodes Uj where lca(Mj) is in any of the inter- 
vals . . . , T n -\- Since at the present, n lineages are able 
to coalesce, we can express^- as, 

n — 1 yi — 1 

gi = n- EI 7 ( r ( lca («*)) > r,) (8) 
;=i+l k=j 

where ti < r; iff j < i, and where /(•) is an indicator 
function taking the value 1 if the condition holds and oth- 
erwise 0. Assuming each lca() operation is 0(1) [29,30], 
preprocessing allows all lea terms to be computed in 0(«) 
time. Thus, calculating^!, . . . ,g n -\ can be done, based on 
Equation 8, in 0(« 3 ). 

Calculation ofkjj iZ 

We let ytj be the ;'th population (read left to right) in 
interval T; (equivalently, the /th branch or /th node sub- 
tending the branch). In order to label every population 
before and after a speciation time s,- uniquely, extra nodes 
can be added to the species tree to form a beaded species 
tree (Figure 2), so that there are i nodes at time su i = 
1, ...,« — 1. For each { e {1, — 1}, there is one node 
of outdegree 2, and i — 1 nodes of outdegree 1. Thus, pop- 
ulation yu corresponds to a branch (equivalently, a node) 
in the beaded species tree. We denote the outdegree of a 
node y by outdegiy). 

In the remainder of this section, we compute the val- 
ues kij iZ , i.e. the number of lineages on branch y^ of the 
beaded species tree during the interval immediately after 
the /th coalescence event (going forward in time), with 




Figure 2 The beaded version of the species tree topology in 
Figure 1a-d. 
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ki t o iZ being the number of lineages "exiting" the branch at 
time s,_i. For example, in Figure lb, we have 

^2,0,1 = 1> ^2,1,1 = 2, £2,2,1 = 3, 
£2,0,2 = li £2,1,2 = li £2,2,2 = 1) 

The value of £y iZ depends on the number of lineages 
entering branch i, l\, as well as the number of lineages 
exiting the branch, and not just on the number of coa- 
lescence events in the interval. For example, in Figure lc, 
^"2,0,1 = 1 and £2,1,1 = 2, while in Figure Id, £2,0,1 = 2 
and £2,1,1 = 3, although the two gene trees have the same 
ranked topology and mi = 1 for both cases. 

To determine the terms £ ; y jZ we note that the number of 
coalescences that have occurred more recently than inter- 
val %i is n — li. In a given interval r;, we let and z (2) be 
the left and right children, respectively, of population z of 
outdegree 2, and let = z^ be the only child of a node 
z of outdegree 1. 

The number of lineages available to coalesce in popula- 
tion z of interval T; is 

outdeg(yi x ) 

h,mt,z = E ^<+l,0,z« ( 9 ) 
/'=1 

where the are the daughter populations (one or two) 
of z. Further, £ Mj o, z = 0 for all z. Since the beaded species 
tree has n 2 /2 nodes, precalculating outdeg(yi iZ ) requires 
0(« 2 ). For 0 <j < mi, we have 

h,j+i,z — 1 /th coalescence on branch z 

ki j z 

h,j+\,z otherwise 

(10) 

Consequently, determining a particular kij }Z is 0(1). Thus 
determining £y jZ for all possible i, mi and li is (see also 
Corollary 4), 

(n-l n ti-i m t \ 
E E EE 
i=2 li=i+l m,=0 ;=0 / 

= O (« 4 ). 

Note that taking the sum over all z is not necessary, as in 
all but one branch the ky >z equals the £i,;+i, z . 

An algorithm 

In summary, we derived an algorithm with runtime 0(w 5 ) 
for calculating the probability of a ranked gene tree given 
a species tree on n tips: 

1. Calculate gi, . . -g n -i using Equation (8). 

2. Calculate Ar (>; ', z (for i,j = 1, . . . , n; z = 1 . . . i), using 
Equations (9) and (10). 



3. Calculate A./ ( y = J2z=l C'i*) ^ or = 1> • • • > w )- 

4. Calculate P^-i,^ \%, U , T] (for i = %..., n; 
li-i = gi-i, ...,n;li=gi,..., «), using Theorem 3. 

5. Calculate P[ & Ul \T\ using Theorem 2. 

6. Calculate P[^# | T] using Theorem 1. 

Conclusions 

In this paper, we provide a polynomial-time algorithm 
(0(m 5 ) where n is the number of species) to calculate the 
probability of a ranked gene tree topology given a species 
tree, summarized in Section 'An algorithm'. We now dis- 
cuss applying these results to computing probabilities of 
unranked gene tree topologies and to inferring ranked 
species trees. 

Computing probabilities of unranked gene tree topologies 

Previous work on computing probabilities of unranked 
gene tree topologies used the concept of coalescent his- 
tories, which specify the branches in the species tree in 
which each node of the gene tree occurs. An unranked 
gene tree probability can then be computed by enumerat- 
ing all coalescent histories and computing the probability 
of each. The number of coalescent histories grows at least 
exponentially when the (unranked) gene tree matches the 
species tree, making this approach computationally inten- 
sive. Coalescent histories can be enumerated either recur- 
sively (e.g., in PHYLONET [31] or [20]) or nonrecursively 
(COAL [19]). 

A much faster approach using dynamic programming 
similar to that used in this paper is implemented in 
STELLS [6], which conditions on the ancestral configura- 
tion in each branch rather than the number of lineages. 
Here an ancestral configuration keeps track not only of the 
number of lineages in a branch in the species tree, but also 
the particular nodes of the gene tree. Different ancestral 
configurations can potentially have the same number of 
lineages within a population. Enumerating ancestral con- 
figurations turns out to have exponential running time for 
arbitrarily shaped trees, but the number of ancestral con- 
figurations is still much smaller than the number of coa- 
lescent histories. When computing probabilities of ranked 
gene tree topologies, however, the ranking specifies the 
sequence of coalescence events, leading to a unique ances- 
tral configuration given the number of lineages in a time 
interval. This fortuitously enables probabilities of ranked 
gene tree topologies to be computed in polynomial time. 

We note that although the number of rankings for a 
gene tree is not polynomial in the number of leaves in 
general, the number of rankings can be small for certain 
tree shapes. For example, if the gene tree has a caterpillar 
shape, in which each internal node has a leaf as a descen- 
dant, then there is only one ranking, and thus computing 
the ranked and unranked gene tree are equivalent. For 
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a pseudo-caterpillar, a tree made by replacing the sub- 
tree with four leaves of a caterpillar with a balanced tree 
on four leaves [20], there are only two rankings possible, 
and for a bicaterpillar [20], for which the left subtree is a 
caterpillar with ni leaves and the right subtree is a cater- 
pillar with n — ni leaves, there are („"~ 2 i) rankings. Thus 
computing unranked gene tree probabilities by summing 
ranked gene tree probabilities can be done in polynomial 
time for some tree shapes. We note that for the approach 
used by STELLS, some tree shapes can also be computed 
in polynomial time, including the cases we mentioned 
with a polynomial number of rankings (caterpillar and 
pseudo-caterpillar). An open question is whether there are 
any classes of unranked gene trees which have a polyno- 
mial number of rankings but an exponential number of 
ancestral configurations, or vice versa. 

Inferring species trees from ranked gene trees 

Our fast calculation of the probability of ranked gene tree 
topologies can be used to determine the maximum likeli- 
hood species tree from a collection of known gene trees. 
Assume we have observed N ranked gene trees (i.e., N 
loci). Now the maximum likelihood species tree 7ml (with 
branch lengths on internal branches) is 

Tml = argmax P[Sft, . . . ,& N \T] 
T 

where 

N H„ 

n&i, . . .,&n\t\ = n p[s&in = n n^ (i) \Tr 

k=l (=1 

(11) 

is a multinomial likelihood. Here P[^|T] can be deter- 
mined with our polynomial-time algorithm, we let Sf^* 
denote the z'th ranked topology, and is the number of 
times ranked topology i is observed, with J2i=i n i = N. 
Note in particular that the ranked topology of Tml might 
differ from the most frequent ranked gene tree topology 
[18]. 

Our derivation of the ranked gene tree probability also 
suggests a way to infer a ranked species tree topology from 
ranked gene tree topologies with a similar flavor as the 
MDC criterion. In MDC, for an input gene tree and can- 
didate species tree, the number of extra lineages (lineages 
which necessarily fail to coalesce due to topological dif- 
ferences between gene and species trees) on each edge of 
the species tree is counted. For MDC, whether the edge of 
the species tree is long or short does not affect the deep 
coalescence cost. In working with ranked gene trees, how- 
ever, we can keep track of the minimum number of extra 



lineages within each time interval r,-. The total number of 
extra lineages in this sense is 

n-l 

X>-a+D (12) 

Minimizing (12) as a criterion for the ranked species tree 
will tend to penalize long edges of the species tree which 
have multiple lineages persisting through multiple species 
divergence events. As an example, in Figure lb, the gene 
tree has a MDC cost of 1 since there are two lineages 
exiting the population immediately ancestral to A and B; 
however the cost according (12) is 2 because there are two 
edges on the beaded version of the species tree (Figure 2) 
that each have an extra lineage. In Figure lc, the gene 
tree has a MDC cost of 0 for the species tree since it has 
the matching unranked topology; however, the number 
of extra lineages from equation (12) is 1. We note that 
in Figure lc, interval T3, incomplete lineage sorting (and 
deep coalescence) have not occurred as these concepts are 
normally used. To capture the idea that coalescence has 
nevertheless occurred in a more ancient time interval than 
allowed, we might refer to the coalescence of A and B in 
Figure lc as an "ancient lineage sorting" event (rather than 
incomplete lineage sorting event) or an ancient coales- 
cence rather than a deep coalescence. We could therefore 
refer to minimizing equation (12) as the Minimize Ancient 
Coalescence (MAC) criterion, which would provide an 
interesting comparison to the usual topology-based MDC 
criterion. 

In practice, a method of inferring a species tree from 
ranked gene trees would require estimating the ranked 
gene trees. This would require clock-like gene trees, or 
trees with times estimated for nodes, which can also be 
inferred under relaxed clock models in BEAST [32]. To 
account for the uncertainty in the gene trees, the counts 
for different ranked gene trees could be weighted by their 
posterior probabilities obtained from Bayesian estimation 
of the gene trees [33]. Thus, in equation (11), we would 
let npc be the posterior probability of ranked topology i at 

locus k, and use m = Ylk=l n ' k as tne estrmate d number 
of times that ranked topology i was observed. Similarly, 
for equation (12), the coalescence cost at a locus could 
be distributed over multiple topologies weighted by their 
posterior probabilities. 
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